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IV 


1.  Introduction 


This  technical  note  provides  a  brief  description  of  a  Java  library  for  Arabic  (and  also  English) 
natural  language  processing  (NLP),  containing  code  for  training  and  applying  the  Arabic  NLP 
system  described  in  Stephen  Tratz’s  2013  paper  “A  Cross-Task  Llexible  Transition  Model  for 
Arabic  Tokenization,  Affix  Detection,  Affix  Labeling,  POS  Tagging,  and  Dependency  Parsing”  in 
the  2013  Statistical  Parsing  of  Morphologically  Rich  Languages  (SPMRL)  workshop  (see  section 
9  for  all  relevant  publications). 

The  system  is  capable  of  clitic  separation,  inflectional  affix  identification  and  labeling, 
part-of-speech  tagging,  and  dependency  parsing  for  Arabic.  Lor  more  information  on  these 
topics  and  the  overall  system,  the  reader  is  referred  to  the  SPMRL  2013  publication. 

Lor  historical  reasons  (section  7),  the  library  also  includes  components  for  English  part-of-speech 
tagging,  dependency  parsing,  preposition  sense  disambiguation,  noun  compound  interpretation, 
and  possessives  interpretation.  Much  of  the  code  is  relevant  to  general  purpose  NLP  applications 
(e.g.,  the  WordNet  interface)  and  is  not  specific  to  either  Arabic  or  English. 


2.  File  Overview 


The  following  is  a  short  description  of  the  top  level  files  and  directories. 

build/  -  directory  with  Ant  build  scripts  for  running  dependency  converters,  training  the  various 
subsystems,  and  running  some  experiments 

conf/  -  directory  with  files  related  to  Arabic  constituent-to-dependency  conversion  as  well  as 
the  English  semantic  annotation  modules 
data/  -  directory  with  a  variety  of  data  used  by  different  submodules 
docs/  -  directory  with  additional  documentation 
examples/  -  a  directory  with  some  example  input  and  output  files 
jar/  -  a  directory  with  a  precompiled  copy  of  this  code 
lib/  -  a  directory  for  any  required  .jar  files 

scripts/  -  a  directory  with  some  scripts  for  applying  trained  models  to  new  data 
src/  -  the  source  code 

README  -  a  README  file  giving  an  overview  similar  to  this  technical  note 
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LICENSE  -  the  license  file 
finegrainconverter-0.2.3.jar 

pennconverter_modified_Spring20 1 2.j  ar 


-  English  dependency  converter  jar  from 
http://sourceforge.net/projects/miacp/ 

-  English  dependency  converter  jar  from 
http://sourceforge.net/projects/miacp/ 

(modified  from  the  pennconverter 
http://nlp.cs.lth.se/software/treebank_converter/) 


3.  Code  Compilation 


The  code  is  typically  compiled  using  Apache  Ant,  freely  available  software  typically  used  for 
building  software  projects.  To  download,  or  for  more  information,  see  http://ant.apache.org/. 

A  variety  of  ant  build  XME  files  are  contained  in  the  build  subdirectory.  To  create  a  jar  file 
containing  the  compiled  code,  switch  to  INSTAEEATION_DIR/build  and  type  “ant  -f 
build_common.xml  makeJar”. 


4.  Training  Instructions 


Eor  the  Arabic  system,  see  doc/AR_TRAINING_INSTRUCTIONS. 
Eor  the  English  system,  see  doc/EN_TRAINING_INSTRUCTIONS  . 


5.  Applying  the  System  to  New  Examples 


Example  scripts  for  applying  a  trained  model  to  new  data  are  provided  under  the  scripts 
subdirectory.  Short  example  texts  are  provided  in  the  examples  subdirectory. 
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6.  License 


This  project  constitutes  a  work  of  the  United  States  Government  and  is  not  subject  to  domestic 
copyright  protection  under  17  USC  Ag  105. 

The  project  includes  and  derives  from  code  licensed  under  the  terms  of  the  Apache  2.0  License 
and  is,  therefore,  licensed  under  the  Apache  2.0  License.  See  the  doc/LICENSE-2.0  file  for  more 
information.  This  file  may  also  be  found  at  http://www.apache.Org/licenses/EICENSE-2.0. 

A  copy  of  WordNet  3.0  (http://wordnet.princeton.edu/wordnet/),  a  freely  available  lexical 
database  for  English,  is  included  with  this  project.  Its  license  is  provided  in  the 
‘doc/WORDNET_EICENSE’  file. 

A  modified  version  of  the  pennconverter,  freely  available  constituent-to-dependency  conversion 
software  described  by  Richard  Johansson  and  Pierre  Nugues  in  their  paper  “Extended 
Constituent-to-dependency  Conversion  for  English,”  in  the  proceedings  of  the  2007  Nordic 
Conference  on  Computational  Einguistics,  is  included  with  this  project.  Its  license  is  provided  in 
the  ‘doc/PENNCONVERTER_EICENSE’  file. 


7.  History 


This  project  is  derived  from  a  collection  of  code  written  as  part  of  my  graduate  student/thesis 
work  at  USC/ISI  that  was  publicly  released  under  the  Apache  2.0  license  in  late  201 1  at 
http://www.isi.edu/publications/licensed-sw/fanseparser/index.html. 

In  mid-2012,  a  project  derived  from  the  original  University  of  Southern  California/Information 
Sciences  Institute  version  of  the  software  was  released  at  http://sourceforge.net/project  s/miacp/. 
This  newer  version  used  a  slightly  different  English  dependency  scheme  and  contained  a  variety 
of  improvements.  However,  the  PropBank- style  SRE  module  was  not  maintained. 

This  current  project  was  derived  from  the  http://sourceforge.net/projects/miacp/release  by  NEP 
researchers  at  the  U.S.  Army  Research  Eaboratory.  A  significant  portion  of  the  code  has  been 
replaced  with  newer  code,  deleted,  or  otherwise  revised.  The  focus  of  this  effort  was  to  create  the 
system  described  in  the  SPMRE  2013  paper  mentioned  in  section  9.  However,  the  English  POS 
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tagger,  parser,  preposition  sense  disambiguator,  noun  eompound  relation  disambiguator,  and 
possessive  relation  disambiguator  are  still  funetional.  (The  English  eomponents  may  be  slightly 
slower  and  slightly  less  aeeurate  than  those  in  the  http://soureeforge.net/projeets/miaep/release.) 


8.  Important  Note 


This  release  eontains  a  variety  of  bug  fixes  and  other  generally  small  ehanges  that  have  been  made 
sinee  the  SPMRL  2013  paper  publieation  and,  as  sueh,  results  will  not  be  identieal — in  general, 
the  results  will  be  insignifieantly  better,  but  there  may  be  some  eases  where  performanee  is  worse. 


9.  Papers  to  Cite 


Authors  wishing  to  eite  to  this  work  should  refer  to  the  following  papers. 

Arabic 

Tratz,  Stephen.  2013.  A  Cross-Task  Flexible  Transition  Model  for  Arabic  Tokenization,  Affix 
Detection,  Affix  Labeling,  POS  Tagging,  and  Dependency  Parsing.  Fourth  Workshop  on 
Statistieal  Parsing  of  Morphologieally  Rieh  Languages  (SPMRL). 

English 

(Note:  These  are  for  earlier  versions  of  the  English  system/eomponents  and,  thus,  it  may  make 
more  sense  to  eite  one  or  more  of  these  along  with  the  2013  SPMRL  paper.) 

Tratz,  Stephen.  2011.  Ph.D.  Thesis.  Semantieally-Enriehed  Parsing  for  Natural  Language 
Understanding.  University  of  Southern  California. 

Dependeney  Parsing 

Tratz,  Stephen,  and  Eduard  Hovy.  2011.  A  Fast,  Accurate,  Non-projective, 

Semantically -Enriched  Parser.  Proeeedings  of  the  Conferenee  on  Empirieal  Methods  in 
Natural  Language  Proeessing  (EMNLP). 
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Preposition  Sense  Disambiguation 

Tratz,  Stephen,  and  Dirk  Hovy.  2009.  Disambiguation  of  Preposition  Sense  Using  Linguistically 
Motivated  Features.  Proceedings  of  Human  Language  Technologies:  The  2009  Annual 
Conference  of  the  North  American  Chapter  of  the  Association  for  Computational  Linguistics 
(NAACL),  Companion  Volume:  Student  Research  Workshop  and  Doctoral  Consortium. 

Hovy,  Dirk,  Stephen  Tratz,  and  Eduard  Hovy.  2010.  What’s  in  a  Preposition?:  Dimensions  of 
Sense  Disambiguation  for  an  Interesting  Word  Class.  Proceedings  of  the  23rd  International 
Conference  on  Computational  Linguistics  (COLING):  Posters. 

Noun  Compound  Interpretation 

Tratz,  Stephen,  and  Eduard  Hovy.  2010.  A  Taxonomy,  Dataset,  and  Classifier  for  Automatic 
Noun  Compound  Interpretation.  Proceedings  of  the  48th  Annual  Meeting  of  the  Association 
for  Computational  Einguistics  (ACE). 

Possessives  Interpretation 

Tratz,  Stephen,  and  Eduard  Hovy.  2013.  Automatic  Interpretation  of  the  English  Possessive. 
Proceedings  of  the  5 1st  Annual  Meeting  of  the  Association  for  Computational  Einguistics 
(ACE). 
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